Conference Proceedings
Open source corpus analysis tools for Malay
T Baldwin, S Awab
Proceedings of the 5th International Conference on Language Resources and Evaluation Lrec 2006 | Published : 2006
Abstract
Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of each over a 26K word sample of Malay text.
Grants
Awarded by Australian Research Council